Week 7 Time Series Classification¶

In [108]:
import numpy as np
from sktime.datasets import load_acsf1
from sklearn.model_selection import train_test_split

# Loading the dataset
X, y = load_acsf1()

The ACS-F1 dataset, or Appliance Consumption Signatures - First Version, is a comprehensive collection of power consumption data for various household appliances.

Part 1: 35 pts Understanding the dataset

A. 7pts Give a verbal description of the dataset from information on the acsf1 detailed webpage, not the summary repository page.

Ans: The ACS-F1 dataset, or Appliance Consumption Signatures - First Version, is a comprehensive collection of power consumption data for various household appliances.

It contains power consumption rading for typical household appliances, divided into 10 categories: mobile phones (via chargers), coffee machines, computer stations (including monitor), fridges and freezers, Hi-Fi systems (CD players), lamp (CFL), laptops (via chargers), microwave ovens, printers, and televisions (LCD or LED).

100 training and 100 test sizes with each time series length of 1460. intended for use in This dataset is intended for time series classification tasks, particularly for identifying which type of appliance is consuming power at any given time. There are no missing values.

The dataset was later edited by Patrick Schäfer and Ulf Leser, who used it in their research on time series classification methods, specifically in their paper on the WEASEL (Word ExtrAction for time SEries cLassification) algorithm.

B. 7pts There are 1460 time steps in each observation. Use len() to display this for any observation in the X_train.

In [109]:
import numpy as np
from sktime.datasets import load_acsf1
from sklearn.model_selection import train_test_split

# Loading the dataset
X, y = load_acsf1()

X_train, y_train = load_acsf1(split="train", return_X_y=True)
X_test, y_test = load_acsf1(split="test", return_X_y=True)

# Displaying the length of time steps for the first observation in X_train
observation_length = len(X_train.iloc[0, 0])
print(f'The length of time steps for the first observation in X_train is: {observation_length}')
The length of time steps for the first observation in X_train is: 1460

C. 7pts Return the counts of classes in y_train

In [110]:
# examining y_train counts
labels, counts = np.unique(y_train, return_counts=True)
print(labels, counts)
['0' '1' '2' '3' '4' '5' '6' '7' '8' '9'] [10 10 10 10 10 10 10 10 10 10]

D. 7pts Plot the first time series for each class(.iloc[0]), label each plot with its specified class name. The patterns should match what can be found on the acsf1 detailed webpage.

In [111]:
import matplotlib.pyplot as plt
from sktime.datasets import load_acsf1
from sklearn.model_selection import train_test_split
import pandas as pd

# Loading the dataset
X, y = load_acsf1()

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Converting y_train to a pandas Series for easier indexing
y_train = pd.Series(y_train)

# Getting unique classes from y_train
unique_classes = y_train.unique()

# Defining a color map
colors = plt.cm.get_cmap('tab10', len(unique_classes))

# Creating subplots for each class
fig, axes = plt.subplots(len(unique_classes), 1, figsize=(30, len(unique_classes) * 5), sharex=True)

# Plotting the first time series for each class
for i, label in enumerate(unique_classes):
    # Select the first instance of the current class in y_train
    first_instance_idx = y_train[y_train == label].index[0]
    first_instance = X_train.iloc[first_instance_idx]["dim_0"]
    
    # Plotting the time series
    axes[i].plot(first_instance, color=colors(i), label=f"class {label}")
    axes[i].legend()
    axes[i].set_title(f'Class {label}')
    axes[i].set_ylabel('Power Consumption')
    axes[i].grid(True)

# Setting a common x-axis label
axes[-1].set_xlabel('Time Steps')

plt.tight_layout()
plt.show()
C:\Users\arpan\AppData\Local\Temp\ipykernel_24536\1234011001.py:19: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.
  colors = plt.cm.get_cmap('tab10', len(unique_classes))
No description has been provided for this image

7pts Each observation is 10 seconds apart. Describe what the plots show for classes 3, 8 and 9. Give some intuition about what appliance each of these three classes might represent.

Class 3: High, intermittent spikes in power consumption, likely a microwave oven

Class 8: Shows clusters of moderate activity with periods of inactivity, possibly a printer.

Class 9: Displays a consistent, low-level power consumption pattern, suggesting a small fridge or freezer.

Part 2: 15 pts Description of Time Series Classification models

A. 5 pts Select one classification model type. Describe how the model works. Why would each be a good or bad fit for this type of data?

Random Forest is a machine learning model that builds multiple decision trees and combines their predictions. It is good handling complex data and is resistant to overfitting.

It would be good becasue it can handle many features (time points) effectively.Captures complex relationships between features.Robust to noise.Provides feature importance.

Might be bad fit becasue it is difficult to interpret how it makes decisions. It Can be computationally expensive and ignores the order of data points (time dependence).

B. pts Select a second classification model type. Describe how the model works. Why would each be a good or bad fit for this type of data?

XGBoost (Extreme Gradient Boosting): It is an ensemble learning method that sequentially builds an ensemble of weak decision trees. Each tree is trained to correct the mistakes of the previous ones, resulting in a powerful and accurate model. Key to XGBoost's performance is its gradient boosting framework, which optimizes the loss function at each iteration, and its use of regularization to prevent overfitting.

XGBoost is designed for numerical data, which aligns well with the power consumption values in the ACSF1 dataset appliance usage might exhibit imbalanced class distribution like previously seen from the visualization. XGBoost can be adjusted to handle such imbalance. It is known for its speed and efficiency, making it suitable. It can provide insights into which time series features are most important for classification, aiding in understanding appliance characteristics.

C. 5 pts Select third classification model type. Describe how the model works. Why would each be a good or bad fit for this type of data?

Support Vector Machines (SVM):

SVM finds the optimal hyperplane that separates data points into different classes with the maximum margin. For non-linearly separable data, kernel functions are used to map data into a higher-dimensional space where linear separation becomes possible. Good fit for ACSF1 data:

SVM can effectively handle the large number of features (time points) in the ACSF1 dataset. Also, for relatively small datasets like this SVM can be computationally efficient.

Might be bad becasue it is sensetive to outliners and they might not adapt to time series data becasue they may not capture temporal dependencies.

Part 3: 50 pts Select one method. Model and examine results

In [103]:
pip install xgboost
Requirement already satisfied: xgboost in c:\apps\anaconda\lib\site-packages (2.1.1)
Requirement already satisfied: numpy in c:\apps\anaconda\lib\site-packages (from xgboost) (1.26.4)
Requirement already satisfied: scipy in c:\apps\anaconda\lib\site-packages (from xgboost) (1.11.4)
Note: you may need to restart the kernel to use updated packages.
In [116]:
import numpy as np
import pandas as pd
from sktime.datasets import load_acsf1
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

def segment_time_series(X, y, segment_length, overlap):
    X_segments = []
    y_segments = []
    for i in range(len(X)):
        series = np.array(X.iloc[i, 0])
        label = y[i]
        for start in range(0, len(series) - segment_length + 1, segment_length - overlap):
            end = start + segment_length
            X_segments.append(series[start:end])
            y_segments.append(label)
    return np.array(X_segments), np.array(y_segments)

# Loading the dataset
X, y = load_acsf1()

# Ensuring the class labels are integers
y = y.astype(int)

# Defining the segment length and overlap
segment_length = 400
overlap = 200

# Segmenting the time series data
X_segments, y_segments = segment_time_series(X, y, segment_length, overlap)

# Converting the segmented data to a suitable format for XGBoost.
X_flattened = np.array([segment.flatten() for segment in X_segments])


# Defining the XGBoost model after hyperparameter tuning
xgb_model = XGBClassifier(
    colsample_bytree=0.8053457692015644,
    learning_rate=0.2520738931152338,
    max_depth=6,
    min_child_weight=1,
    n_estimators=353,
    subsample=0.6735406596951892,
    use_label_encoder=False,
    eval_metric='mlogloss',
    random_state=42
)

# Fitting the model on the training data
xgb_model.fit(X_train, y_train)
C:\Apps\Anaconda\Lib\site-packages\xgboost\core.py:158: UserWarning: [20:16:48] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0015a694724fa8361-1\xgboost\xgboost-ci-windows\src\learner.cc:740: 
Parameters: { "use_label_encoder" } are not used.

  warnings.warn(smsg, UserWarning)
Out[116]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8053457692015644, device=None,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='mlogloss', feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.2520738931152338,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=6, max_leaves=None,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=353, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8053457692015644, device=None,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='mlogloss', feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.2520738931152338,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=6, max_leaves=None,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=353, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)

B. 10 pts Return the accuracy score of the train set and test set (suggestion to use .score()). Print the confusion matrix and classification report of the test set.

In [118]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Converting the time series data to a suitable format for XGBoost (flatten the time series)
X_flattened = np.array([np.array(series).flatten() for series in X])

# Printing the number of instances in the train and test sets
print(f"Number of instances in the training set: {X_train.shape[0]}")
print(f"Number of instances in the test set: {X_test.shape[0]}")

# Ensuring data type is numeric
X_train = X_train.astype(float)
X_test = X_test.astype(float)

# Getting the accuracy score for train and test sets
train_accuracy = xgb_model.score(X_train, y_train)
test_accuracy = xgb_model.score(X_test, y_test)

print(f'Train Accuracy: {train_accuracy}')
print(f'Test Accuracy: {test_accuracy}')

# Making predictions on the test data
y_pred = xgb_model.predict(X_test)

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

# Visualizing the confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Printing the classification report
classification_rep = classification_report(y_test, y_pred)
print('Classification Report:')
print(classification_rep)
Number of instances in the training set: 840
Number of instances in the test set: 360
Train Accuracy: 1.0
Test Accuracy: 0.8972222222222223
Confusion Matrix:
[[33  0  0  0  0  0  2  0  0  0]
 [ 0 29  2  0  0  3  0  0  0  0]
 [ 1  0 34  0  0  3  0  0  0  0]
 [ 0  0  0 33  0  0  0  0  0  0]
 [ 0  4  4  0 28  0  0  0  0  0]
 [ 0  0  2  0  0 28  0  1  0  7]
 [ 1  0  2  0  0  0 39  1  0  0]
 [ 0  0  0  0  0  1  0 30  0  0]
 [ 0  0  0  2  0  0  0  0 31  0]
 [ 0  0  0  0  0  1  0  0  0 38]]
No description has been provided for this image
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.94      0.94        35
           1       0.88      0.85      0.87        34
           2       0.77      0.89      0.83        38
           3       0.94      1.00      0.97        33
           4       1.00      0.78      0.88        36
           5       0.78      0.74      0.76        38
           6       0.95      0.91      0.93        43
           7       0.94      0.97      0.95        31
           8       1.00      0.94      0.97        33
           9       0.84      0.97      0.90        39

    accuracy                           0.90       360
   macro avg       0.90      0.90      0.90       360
weighted avg       0.90      0.90      0.90       360

C. 10 pts Discuss the precision score for class 8. Support this with your visual opinion from plots in 1D as well as the confusion matrix.

Ans : The precision score of 0.80 for class 8 is supported by the confusion matrix and visual plot analysis. The plot for class 8 in 1D shows a distinctive pattern, noticeable peaks and intervals of low activity. This pattern is different from other classes but have similarities with certain patterns in other classes, particularly in the frequency and amplitude of peaks.

Misclassifications occur primarily with class 3 and class 7, where certain peaks might be mistaken for those of class 8.

This indicates that while class 8 has a distinguishable pattern, some overlapping features with other classes lead to false positives.

D. 10 pts Discuss the recall score for class 5. Support this with your visual opinion from plots in 1D as well as the confusion matrix.

Ans :Recall score class 5 is 0.737.

True Positives: 28 False Negatives: 2

The model correctly identified 73.7% of actual Class 5 instances (28 out of 36), this indicates a resonable performance.

Class 5 patterns exhibit rapid fluctuations between -1 and 1.7, creating a distinct profile. However, this pattern overlaps with several other classes, particularly 1, 2, 7, and 9, leading to confusion and misclassification which is well supported by plots in 1D.

10 pts Which metric do you feel is the most important in the following business case: You work for ComEd, a local electricity supplier. You head a department that uses analytics to plan electrical supply for Chicago's power grid. Assume that your department budgets for a certain amount of electrical supply at a fixed low rate. If the total demand in Chicago stays within the purchased supply levels, your department is performing. If the demand breaches this supply level, the company is penalized and the rate for your supply multiplies by 100x, destroying your department’s performance. If you had to build your forecast model to classify patterns of high electrical usage (appliances, air conditioning, water heating) vs low electrical usage (lighting, tv, phone chargers) which metric (precision or recall) would you use?

Ans : I would use recall, altough both are important in this case. Recall is important becasue it minimizes the risk of underestimating demand and incurring heft penalty 100X , destroying our department's performance.

Precision also plays important role to reduce flase positive and avoid unnecessary costs.I would build a model that excels at identifying high demand periods to minimize the risk of underestimating supply.

Also, I would implement safety margin. By balancing precision and recall, we can create a robust model that effectively manages electrical supply while minimizing both financial risks and operational inefficiencies.

THE END!¶